Part I: Linear VS Nonlinear (or greedy) approximations

1. Legendre polynomials

Task: plot the first five (modified) Legendre polynomials. Verify, numerically, that they are orthonormal.

Legendre basis are thought in order to create a countable base for Hilbert space, for functions defined in [-1, +1]:

\[P_0(x) = 1, P_1(x) = x, P_2(x) = \frac{1}{2} (3x^2-1),\, . \, .\, ., Pj (x) = \frac{1}{2^j j!} \frac{d^j}{dx^j}(x^2-1)^j \, . \, .\, . \]

Since computing j-th derivate of \((x^2-1)^j\) would return a very long expression, it’s usefull to express these polynomials explicitly through a recursive relation:

\[P_{j+1}(x) = \frac{(2j + 1) x P_j (x) - j P_{j-1}(x)}{j + 1}\]

legendre=function(x,j) {
  if(j==0) {
    return(1)
    }
  if(j==1) {
    return(x)
  }
  else return(
        (((2*j-1)*x*legendre(x,j-1))-(j-1)*legendre(x,j-2))/j)
}

These polynomials are orthogonal…

\[\int_{-1}^{+1}P_{j_1}(x)P_{j_2}(x)dx =0 \quad \forall \: j\]

a=integrate( function(x) legendre(x,5)*legendre(x,3), lower=-1,upper=1)$value
b=integrate( function(x) legendre(x,5)*legendre(x,3), lower=-1,upper=1)$value
e=integrate( function(x) legendre(x,5)*legendre(x,3), lower=-1,upper=1)$value
c(a,b,e)
[1] 6.245005e-17 6.245005e-17 6.245005e-17

…but not normal, since:

\[\int_{-1}^{+1}P^2_j(x)dx =\frac{2}{2j+1}\]

Actually:

\(\int_{-1}^{+1}P^2_1(x)dx =\) 0.6666667 \(\int_{-1}^{+1}P^2_2(x)dx =\) 0.4 \(\int_{-1}^{+1}P^2_3(x)dx =\) 0.2857143 \(\int_{-1}^{+1}P^2_4(x)dx =\) 0.2222222

However, we can normalize Legendre polynomials:

\[Q_j(x) = \sqrt{\frac{(2j + 1)}{2}} P_j (x)\]

q_x=function(x,j) {
  return(legendre(x,j)*sqrt((2*j+1)/2))
}

… finally those will realize an Orthonormal basis for \(L_2([-1, +1])\). where \(L_2\) is defined as: \[L_2([a, b]) = \{g : [a, b] \mapsto \mathbb{R} \:s.\: t. \left \| g \right \|_2 = \int_a^b\left | g(x)^2 \right | dx < \infty\}\]

indeed:

\(\int_{-1}^{+1}P^2_1(x)dx =\) 1 \(\int_{-1}^{+1}P^2_2(x)dx =\) 1 \(\int_{-1}^{+1}P_1(x)P_4(x)dx =\) 5.32524e-17

We could say that averything seems ok… except for a tiny little silly thing: It’s not going to work! Indeed the recursive function we have defined before, has two parallel loops ( e.g. for the j-th base it needs to go through, almost, \(j^2\) levels ) so for the sake of our PC we should parellelize calculations assigning them to different threads/core, or caching the function … or just use the gsl library…

library(gsl)
leg.basis2 <- function(j,x) sqrt((2*j + 1)/2)*legendre_Pl(j, x)

…Basis are ready, now let’s set up the target function: Doppler function rescaled to [-1,1]:

\[g(x) =\sqrt{(x(1-x))}sin(\frac{2.1\pi}{x+0.05})\] \[h(x)= g(0.5(x+1))\]

Fourier Coefficient

Evaluate the Fourier coefficients of the Doppler under our Cosine-basis:

\[\beta_{j}=\left \langle g,\phi_j \right \rangle_{L_2}= \int_{-1}^{+1}g(x)\phi_j(x)dx\]

First 10 coefficients:

 [1]  0.0684015246  0.1007691143  0.0643306763 -0.0424656443 -0.1369864121
 [6] -0.0949871788  0.0650198686  0.1341083357  0.0008080226 -0.1207487365
\newpage

Linear approximation

Approximation for different n-terms approximations J={5, 10, 25, 50, 100, 150} as follow:

\[g_{J}(x)=\sum_{j=0}^{J-1}\beta_j\phi_j(x)\]

and evaluating the error:

\[Error= \left \| h(x)-h_{J}(x) \right \|^2_{L_2}= \int_{-1}^{+1}(h(x)-h_{J}(x))^2dx\]

\newpage

Non-Linear approximation

A little improvement can be seen if we select the largest coefficient, in order to use only the more relevant function!

\[g_{J}^*(x)=\sum_{j\in \Lambda_{J}}\beta_j\phi_j(x)\]

Just to check that everything is working, notice that linear & non linear on the first 200 terms recover the same error value!

\newpage

Part II: Tensor Product Models & 3D Plots

1. Building the Tensor Product Basis

Task: Define all the functions needed to build the approximant function, learning, along the way, how to numerically evaluate double integrals.

We can build the Tensor basis using the definition provided:

\[\phi_{j1,j2}(x_1,x_2)=\phi_{j1}(x_1)\phi_{j2}(x_2) \enspace j1,j2=0,1...\]

for the cosine basis defined in previous labs

\[\phi_{0,0}=1 \enspace \phi_{j1,j2}=\sqrt{2}cos(j_1\pi x)\sqrt{2}cos(j_2 \pi x)\]

tensor.cosbasis = function(x, j1,j2){
  (1*(j1 == 0) + sqrt(2)*cos(pi*j1*x[1])*(j1 > 0))*
    (1*(j2 == 0) + sqrt(2)*cos(pi*j2*x[2])*(j2 > 0))
}

We managed to evaluate the performance of double integrals libraries in order to speed up the computation of forthcoming steps. We tried two libraries: cubature and R2Cuba. The second library performed better. Thanks to the double integral library we can check the normality and orthogonality of our basis:

library(cubature)
library(R2Cuba)
## Check normality
adaptIntegrate(function(x) tensor.cosbasis(x,1,2)^2, lowerLimit = c(0,0), 
               upperLimit = c(1,1),maxEval = 50000)
$integral
[1] 1

$error
[1] 1.110223e-16

$functionEvaluations
[1] 255

$returnCode
[1] 0
cuhre(2,1,function(x) tensor.cosbasis(x,1,2)^2,lower = c(0,0),upper = c(1,1))
Iteration 1:  65 integrand evaluations so far
[1] 1.00007 +- 1.02672      chisq 0 (0 df)
Iteration 2:  195 integrand evaluations so far
[1] 0.999991 +- 0.186743    chisq 5.24566e-09 (1 df)
Iteration 3:  325 integrand evaluations so far
[1] 0.999995 +- 0.0933761   chisq 5.35999e-09 (2 df)
Iteration 4:  455 integrand evaluations so far
[1] 1 +- 9.50535e-06    chisq 8.96398e-09 (3 df)
integral: 1 (+-9.5e-06)
nregions: 4; number of evaluations:  455; probability:  2.2572e-13 
## Check othogonality
prod.base.base <- function(x,j1,j2) {
  tensor.cosbasis(x,j1[1],j1[2])*tensor.cosbasis(x,j2[1],j2[2])
}

adaptIntegrate(function(x) prod.base.base(x,c(1,2),c(1,2)), lowerLimit = c(0,0), 
               upperLimit = c(1,1),maxEval = 50000)
$integral
[1] 1

$error
[1] 1.110223e-16

$functionEvaluations
[1] 255

$returnCode
[1] 0
cuhre(2,1,function(x) prod.base.base(x,c(1,2),c(1,2)),lower = c(0,0),upper = c(1,1))
Iteration 1:  65 integrand evaluations so far
[1] 1.00007 +- 1.02672      chisq 0 (0 df)
Iteration 2:  195 integrand evaluations so far
[1] 0.999991 +- 0.186743    chisq 5.24566e-09 (1 df)
Iteration 3:  325 integrand evaluations so far
[1] 0.999995 +- 0.0933761   chisq 5.35999e-09 (2 df)
Iteration 4:  455 integrand evaluations so far
[1] 1 +- 9.50535e-06    chisq 8.96398e-09 (3 df)
integral: 1 (+-9.5e-06)
nregions: 4; number of evaluations:  455; probability:  2.2572e-13 

Now that we built our basis we can compute the Fourier Coefficients defined as:

\[\beta_{j1,j2}=\left \langle g(x_1,x_2),\phi_{j1,j2}(x_1,x_2) \right \rangle_{L_2}= \int_{0}^{1}\int_{0}^{1}g(x_1,x_2),\phi_{j1,j2}(x_1,x_2)dx_1dx_2\]

g_x=function(x) {
  
  x[1]+cos(x[2])
}


j1.max <- 50
j2.max <- 50
f.coeff <- matrix(NA, j1.max, j2.max)


for (idx1 in 0:(j1.max-1)){
  for (idx2 in 0:(j2.max-1)){
    coeff = tryCatch(
      cuhre(2,1,function(x) prod.base.g(x,idx1,idx2),lower = c(0,0),upper = c(1,1),
            flags = list(verbose=0))$value,
                     error = function(e) NA
                    )
          f.coeff[idx1+1,idx2+1] = coeff
  }
}

It’s a long computation for the PC, so we have computed once and stored them into a file: ‘fcoeff.txt’. So you need just to read.table(‘fcoeff.txt’) them. Furthermore thanks to our i7 Mac processor, we didn’t need to set a max number of evaluation for the cuhre function :)

Now in order to proceed, instead of evalueate the function in just three combinations of J1 and J2, we decided to observe the risk function behaviour in 4 blocks cases, first two with just any coefficents, second two with more coefficients, and then two particulat case: just 2 and all of them . Per each block we set the total amount of coefficients to use, and then try different combination ( j1j2)… in this way we can observe that…. the functions it’s so smooth that no matter what couple you choose ( homogeneous or not ) the error is going down proportionally to the approximation order. Although the min error is around 37*37 coefficients… thus don’t go too far

   j1 j2          err
1   1  5 8.338556e-02
2   2  4 1.315214e-03
3   3  3 1.498807e-03
4   4  2 1.441359e-03
5   5  1 1.944249e-02
6   2  6 1.234234e-03
7   3  5 1.257698e-03
8   4  4 3.012896e-04
9   5  3 4.848833e-04
10  6  2 1.304208e-03
11  3  7 1.221834e-03
12  4  6 2.200715e-04
13  5  5 2.437819e-04
14  6  4 1.633919e-04
15  7  3 3.476281e-04
16 20 40 1.830399e-06
17 30 30 7.114881e-07
18 40 20 9.250931e-07
19  1  1 1.025843e-01
20 50 50 1.004057e-04

(5,5)

library(plot3D)
library(rgl)
library(rglwidget)

#Define sequences of X and Y in order to evaluate the function
X <- seq(0, 1, length = 100)
Y <- X
count=length(X)
# Z is a matrix of the results of the function evaluated in X and Y 
Z=matrix(nrow=count,ncol=count)
for (i in 1:count) {
  for (k in 1:count) {
    Z[i,k]=g_x(c(X[i],Y[k]))
  }
}
Z[is.na(Z)] <- 1

# We open the 3d plot environment
open3d()
glX 
  1 
bg3d("white")
material3d(col = "black")
# Plot the function
persp3d(X,Y,Z,  col = "lightgrey",
        xlab = "X", ylab = "Y", zlab = "g(x)")




X1 <- seq(0, 1, length = 100)
Y1 <- X1
count2=length(X1)
Z1=matrix(nrow=count2,ncol=count2)
for (i in 1:count2) {
  for (k in 1:count2) {
           Z1[i,k]=predict(c(X1[i],Y1[k]),5,5)
           }
      }

bg3d("white")
nbcol = 100
color = rev(rainbow(nbcol, start = 0/6, end = 4/6))
zcol  = cut(Z1, nbcol)
persp3d(X1, Y1, Z1, col = color[zcol],
        xlab = "X", ylab = "Y", zlab = "prediction",add=T)

You must enable Javascript to view this page properly.

(7,3)

library(plot3D)
library(rgl)
library(rglwidget)

#Define sequences of X and Y in order to evaluate the function
X <- seq(0, 1, length = 100)
Y <- X
count=length(X)
# Z is a matrix of the results of the function evaluated in X and Y 
Z=matrix(nrow=count,ncol=count)
for (i in 1:count) {
  for (k in 1:count) {
    Z[i,k]=g_x(c(X[i],Y[k]))
  }
}
Z[is.na(Z)] <- 1

# We open the 3d plot environment
open3d()
glX 
  3 
bg3d("white")
material3d(col = "black")
# Plot the function
persp3d(X,Y,Z,  col = "lightgrey",
        xlab = "X", ylab = "Y", zlab = "g(x)")




X1 <- seq(0, 1, length = 100)
Y1 <- X1
count2=length(X1)
Z1=matrix(nrow=count2,ncol=count2)
for (i in 1:count2) {
  for (k in 1:count2) {
           Z1[i,k]=predict(c(X1[i],Y1[k]),7,3)
           }
      }

bg3d("white")
nbcol = 100
color = rev(rainbow(nbcol, start = 0/6, end = 4/6))
zcol  = cut(Z1, nbcol)
persp3d(X1, Y1, Z1, col = color[zcol],
        xlab = "X", ylab = "Y", zlab = "prediction",add=T)

You must enable Javascript to view this page properly.

(37,37)

library(plot3D)
library(rgl)
library(rglwidget)

#Define sequences of X and Y in order to evaluate the function
X <- seq(0, 1, length = 100)
Y <- X
count=length(X)
# Z is a matrix of the results of the function evaluated in X and Y 
Z=matrix(nrow=count,ncol=count)
for (i in 1:count) {
  for (k in 1:count) {
    Z[i,k]=g_x(c(X[i],Y[k]))
  }
}
Z[is.na(Z)] <- 1

# We open the 3d plot environment
open3d()
glX 
  5 
bg3d("white")
material3d(col = "black")
# Plot the function
persp3d(X,Y,Z,  col = "lightgrey",
        xlab = "X", ylab = "Y", zlab = "g(x)")




X1 <- seq(0, 1, length = 100)
Y1 <- X1
count2=length(X1)
Z1=matrix(nrow=count2,ncol=count2)
for (i in 1:count2) {
  for (k in 1:count2) {
           Z1[i,k]=predict(c(X1[i],Y1[k]),37,37)
           }
      }

bg3d("white")
nbcol = 100
color = rev(rainbow(nbcol, start = 0/6, end = 4/6))
zcol  = cut(Z1, nbcol)
persp3d(X1, Y1, Z1, col = color[zcol],
        xlab = "X", ylab = "Y", zlab = "prediction",add=T)

You must enable Javascript to view this page properly.

\newpage

Part III: Variable Selection & Cross-Validation

Correlation between Data

Lets take a look to the correlation between the covariates

X.cor <- cor(dat[,1:268])
mean(abs(X.cor))
[1] 0.5309039

…very high (average) correlation, in particular between values observed at nearby frequencies (…quite naturally…). We gonna have some problems here…

So in conclusion, first 6 graph showed us that values os NIR are related one to each other for close freq… but also (last one) the information related to some frequency is redoundant, because the curves are often parallel.. so we should drop a lot of them

#Fitting the model on the train test
m<-lm(y~.,data=pet.train)
#Summaries of the model
head(m$coefficients,28)
(Intercept)         X.1         X.2         X.3         X.4         X.5 
   549.8032   -332.3468   1004.7647  -2073.9778   4109.1016  -8304.2185 
        X.6         X.7         X.8         X.9        X.10        X.11 
 13202.7075 -18149.9587  29237.7435 -41178.8220  42563.9510 -35747.4546 
       X.12        X.13        X.14        X.15        X.16        X.17 
 19386.5545  -1308.2303  -9830.0209  14586.6114 -20267.4051  27405.1965 
       X.18        X.19        X.20        X.21        X.22        X.23 
-16962.1355  -1887.8544   4341.8303          NA          NA          NA 
       X.24        X.25        X.26        X.27 
         NA          NA          NA          NA 

We are encountering the small (n) - large (p) problem where the parameters,i.e., the coefficients of the independent variables are way larger than the sample size. This affects the degrees of freedom of the residuals and thus we are getting NA for the most estimates and for the statistics (F statistic since we are getting negative degrees of freedom)

Step forward selection

Since there’s a different in the scale of values between the variates and the covariates we decided to scale the \(\mathbb{X}\)

######
ytr<-pet.train[ , 269] 
Xtr = pet.train[ , -c(269,270)] 
yte = pet.test[ , 269] 
Xte = pet.test[ , -c(269,270)]
#Standardising the independent variables
Ztr = scale(Xtr)
Zte = scale(Xte)
  • Definde FWD:
fwd.reg <- function(y, X, k.max = ncol(X), stand = TRUE){
  # Standardize if asked 
  if (stand) X <- scale(X) 
  # Initialize variable sets & other quantities
  S = NULL 
  # active set
  U = 1:ncol(X) 
  #inactive set 
  k.loc = 0 
  # current active set dim
  ee = y 
  #Loop
  while (k.loc < k.max){
    ## Update loop-index 
    k.loc = k.loc + 1 
    ## Step 1: Evaluate correlation with inactive variables 
    rr = abs( cor(X[,U], ee) )
    ## Step 2: Extract the most correlated variable & update the sets
    J = U[which.max(rr)]
    S = c(S, J) 
    U = U[-which.max(rr)] 
    ## Step 3: Regress on active var's and get residuals
    ee = resid( lm(y ~ X[,S]) ) } 
  # Output
  return(S) 
}

S<-fwd.reg(ytr,Xtr) 
S
 [1]  40 241   1 150  90  59  17  77  56  57  65   4  66  61  63   2  54
[18]  68  67  60

Now we have an order of the most correlated independent variables with the dependent. The rest were not chosen because of independence, indeed we have selected the covariates that in each step have some relation with the residuals of the model, so in each step we add relevant information that can lower the error!!! Infact having a similar trend with the residuals will plug into the madel the variable concerned. Although it depends on the previous choice, infact we cannot claim that this is the best model, but a local best model, maybe another set of variables could do better but how could we find it? Infact in each step we go to the local minimization of the error, looking at the relation with the residuals..

# Initialize the vectors of scores 
model.score <- matrix(NA, length(S),4)

#y hat
m=lm(ytr ~ Ztr)
yhat <- predict(m, data.frame(Ztr) ) 
train.error=ytr-yhat

library(MASS)
for (idx in 1:length(S)){
  
  
  # Build the data.frame with the FWD-selected variables
  ztr <- Ztr[, S[1:idx], drop = FALSE] 
  zte <- Zte[, S[1:idx], drop = FALSE] 
  xtr <- Xtr[, S[1:idx], drop = FALSE]
  xtr=data.matrix(xtr)
  
  ####Train-Split Schema
  
  m=lm(ytr ~ ztr)
  yhat <- predict(m, newdata=list(ztr=zte ) )
  MSE=mean((yte - yhat)^2)
  
  #### Leave One Out
  yhat <- predict(m, newdata=list(ztr=ztr ) )
  hat=function(x) diag(x%*%ginv(t(x)%*%x)%*%t(x))
  
  R.LOO=1/length(ytr)*sum(((ytr - yhat)/(1-hat(ztr)))^2)
  
  
  #### Generalized  Cross-Validation
  
  R.GCV= 1/20*sum((ytr - yhat)^2)/((1-(idx/length(train.error)))^2)
  
  #Cross Validation
  nfold<-5
  rarr<-sample(1:nrow(ztr))
  rarr
  fold.str<-matrix(rarr, nrow = nfold, ncol = round(nrow(ztr)/nfold))
  fold.str
  ########      NB sono 21, fold da 5 ---> ne scarti uno (la 20 riga nel nostro caso)
  score = c()
  for (i in 1:nfold) {
    test.idx<-fold.str[i,]
    test.loc<-ztr[test.idx,]
    test.y=ytr[test.idx]
    train.idx = c(fold.str[-i,])
    train.loc = ztr[train.idx,]
    train.y=ytr[train.idx]
    m.loc = lm(train.y ~ train.loc)
    y.hat = predict(m.loc, newdata=list(train.loc=test.loc))
    score[i] = mean(y.hat - test.y)^2
  }
  cv.score = mean(score)
  model.score[idx,] = c(MSE,cv.score,R.LOO,R.GCV)
}

                        Optimal number of cov.
Train-test split Scheme                      3
5-fold CV                                   12
Leave-1-out                                 20
Generalized CV                              20
              RMSE
k=3 Tr-Te 14.10791
k=12 5-CV 15.70449
k=20 LOO  15.04671
k=20 GCV  15.04671

Generally we used the best 20 correlated variables. That was in order to conduct the regression by keeping the degrees of freedom positive. The optimal number of variables that is choosen depends on wich criteria we use to evaluate the model… In the end the best model achieved with this algorithm is the one with X.40, X.241 and X.1. We cannot claim this is the best model… Infact there have been a lot of incongruences in our procedure, for example extimate parameters and compute the error on the same data.